Method for gene mapping from chromosome and phenotype data

ABSTRACT

The present invention relates to a method for gene mapping from chromosome and phenotype data, which utilizes linkage disequilibrium between genetic markers m i , which are polymorphic nucleic acid or protein sequences or strings of single-nucleotide polymorphisms deriving from a chromosomal region. The method according to the invention is based on discovering and assessing tree-like patterns in genetic marker data. It extracts, essentially in the form of substrings and prefix trees, information about the historical recombinations in the population. This infor-mation is used to locate fragments potentially inherited from a common diseased founder, and to map the disease gene into the most likely such fragment. The method measures for each chromosomal location the disequilibrium of the prefix tree of marker strings starting from the location, to assess the distribution of disease-associated chromosomes.

FIELD OF THE INVENTION

The present invention relates to a method for gene mapping fromchromosome and phenotype data, which utilizes linkage disequilibriumbetween genetic markers m_(i), which are polymorphic nucleic acid orprotein sequences or strings of single-nucleotide polymorphisms derivingfrom a chromosomal region.

BACKGROUND OF THE INVENTION

Gene mapping aims at discovering a statistical connection from aparticular disease or trait to a narrow region in the genome probablycontaining a gene that affects the trait. In particular, the discoveryof new disease susceptibility genes can have an immense importance forhuman health care. The gene and the proteins it produces can be analyzedto understand the disease causing mechanisms and to design newmedicines. Further, gene tests on patients can be used to assessindividual risks and for preventive and individually tailoredmedications. Obviously, gene mapping is receiving increasing interestamong medical industry.

Genetic markers along chromosomes provide data that can be used todiscover associations between patient phenotypes (e.g., diseased vs.healthy) and chromosomal regions (i.e., potential disease gene loci).The growing number of available genetic markers, anticipated to reachhundreds of thousands in the next few years, offers new opportunitiesbut also amplifies the computational complexity of the task.

Human genome sequencing efforts, the first ones now almost complete,read the full human DNA sequence. There are methods for recognizingwhere there are genes in the sequence—the number of which is currentlyestimated to be around 30,000. However, we lack methods for deriving thefunction of a gene from the sequence information. Gene mappingapproaches this problem for one disease at a time. It aims atdiscovering areas in the genome—hopefully small—that have a statisticalconnection to a given trait, thus narrowing down the area to be analyzedwith expensive laboratory methods.

A typical setting for gene mapping is a case-control study of somechromosome of diseased and healthy individuals. Instead of looking atthe DNA of the whole chromosome, only certain marker segmentsdistributed along the chromosome are considered. By the analysis ofsimilarities within the disease-associated chromosomes on one hand andthe differences between the disease-associated and control chromosomeson the other, one can try to locate likely areas for a gene thatpredisposes people to the disease analyzed.

The overall goal of the method according to the invention is to locate adisease-susceptibility gene for a given disease. In gene mapping, theaim is to identify a narrow chromosomal region within which the gene islikely to be; this area can then be analyzed in more detail withlaboratory tools. We next briefly review the genetic background; withoutloss of generality, we restrict the discussion in this paper to onechromosome.

Marker Data

A genetic marker is a short polymorphic region in the DNA, denoted hereby M1, M2, . . . . The different variants of DNA that different peoplehave at the marker are called alleles, denoted in our examples by 1, 2,3, . . . . The number of alleles per marker is small: typically lessthan ten for microsatellite markers, and exactly two for singlenucleotide polymorphisms (SNP). The collection of markers used in aparticular study is its marker map, and the corresponding alleles in agiven chromosome constitute its haplotype (FIG. 1). It is a major taskof a gene mapping study to design the marker map and to obtain thehaplotype data. That is where we start, and for the purposes of thispaper the input data consists of haplotypes of diseased and controlpersons—or, in computer science terms, aligned allele strings,classified to positive and negative examples.

Linkage Disequilibrium

All the current carriers of a disease-susceptibility gene have inheritedit from a founder who introduced the gene mutation to the population. Ifthere has been only one or few such founders, then many of the currentcarriers are related, may share some segments of the chromosome, andlend themselves to gene mapping studies. In particular, segments fromthe mutation carrying founder chromosomes are over-represented among theaffected at mutation locus. Relatively young (e.g. 1000 years)population isolates are promising sources of data in this respect:disease-susceptibility genes may have been introduced by one or twofounders only, and the gene may be over-represented in the population.Kainuu region in eastern Finland is an example of such a fruitful areafor genetic studies.

If there are conserved regions at the mutation locus, then it can bepossible to observe linkage disequilibrium (LD), or non-randomassociation between nearby markers (FIG. 2). There are severestatistical problems, however, in observing LD. Mutation carriers oftenonly have a higher risk of being diseased than non-carriers, and in acase-control study both groups can be mixes of carriers andnon-carriers. Further on, since the selection of patients is more orless random, and the whole coalescence process leading to LD isstochastic, it is a challenge to recognize LD and the DS gene locationfrom all the noise.

Gene Mapping

In diseases with a reasonable genetic contribution, and especially inpopulation isolates, affected individuals are likely to have higherfrequencies of certain alleles and haplotype patterns near the DS genethan control individuals. This is the starting point of LD-based mappingmethods: where does the set of affected chromosomes show linkagedisequilibrium? The problem is far from trivial, however. Thecoalescence process is stochastic; mutation carriers often only have ahigher risk of being diseased than non-carriers, and in a case-controlstudy both groups are usually mixes of carriers and non-carriers; andfinally, there is missing information and haplotyping ambiguities.

Most current gene mapping methods based on linkage disequilibrium lookjust at individual markers or neighboring markers, measure theirassociation to the disease status, and predict the gene locus to beco-located with the strongest association. However, since differentmutation carriers share different segments, there is no single marker orpattern that is representative of the shared segments.

In the recent years, several statistical methods have been proposed todetect LD (Terwilliger 1995, Devlin et al. 1996, Lazzeroni 1998, Serviceet al. 1999, McPeek et al. 1999). The emphasis has been on fairlyinvolved statistical models of LD around a DS gene. They model wholerecombination histories and some are robust to high levels ofheterogeneity. On the other hand, the models are based on a number ofassumptions about the inheritance model of the disease and the structureof the population, which may be misleading for the statisticalinference. The methods tend to be computationally heavy and thereforebetter suited for fine mapping than genome screening.

Haplotype Pattern Mining or HPM (Toivonen et al. 2000a, Toivonen et al.2000b) is based on analyzing the LD of sets of haplotype patterns,essentially strings with wildcard characters. The method first finds allhaplotype patterns that are strongly associated with the disease status,using ideas similar to the discovery of association rules (Agrawal etal. 1993, Agrawal et al. 1996). Since the patterns may contain gaps theycan account for some missing and erroneous data. In the second step,each marker is ranked by the number of patterns that contain it. Eitherthis score is used as a basis for the prediction or, preferably, apermutation test is used to obtain marker-wise p values. HPM has beenextended for detecting multiple genes simultaneously (Toivonen et al.2000b) and to handle quantitative phenotypes and covariates (Sevon etal. 2001).

Nakaya et al (Nakaya et al. 2000) investigate the effect of multipleseparate markers, each one thought to correspond to one gene, onquantitative phenotypes. Their work is a generalization of the LOD scoreto multiple loci, and it does not handle haplotype patterns.

An alternative approach for LD-based mapping is linkage analysis. Theidea is to analyze family trees, and to find out which markers tend tobe inherited to offspring in conjunction with the disease. Linkageanalysis does not rely on common founders, so in that respect it is morewidely applicable than LD-based methods. The downside is that estimatesare rough (due to the smaller effective number of meiosis sampled), andthat collecting information from larger families is more difficult andexpensive.

Transmission/disequilibrium tests (TDT) (Spielman et al. 1993) are anestablished way of testing association and linkage in a sample wherelinkage disequilibrium exists between the mutation locus and nearbymarker loci. TDT detects deviations between observed and expected countsfor each allele transmitted from heterozygous parents to affectedoffspring.

Single permutation tests have been used in mapping studies before(Churchill and Doerge 1994, Laitinen et al. 1997, Long and Langley1999). However, if more complex data is to be analyzed, these singlepermutation tests are too expensive and computationally very ineffectiveand even inoperative.

Genetic markers provide an economical, sparse view of chromosomes. Evensparsely located markers can be very informative: given an ancestor witha disease gene, the descendants that inherit the gene are also likely toinherit a string of alleles of nearby markers. The exact probability ofinheriting any combination of markers depends on the gene location withrespect to the markers, the population history or the coalescencehistory, and marker mutations; all of these are unknown. There is acontinuous need for more effective gene mapping methods.

The object of the present invention is to provide a novel method forgene mapping from chromosome and phenotype data. The method according tothe invention considers the recombination histories—sort of familytrees—that are likely to have caused the observed trees of patterns. Thedisease susceptibility (DS) gene is then predicted to be where thestrongest genetic contribution is visible in the trees. Thecontributions of the method according to the invention are:

-   (1) a novel approach to gene mapping using tree patterns,-   (2) an efficient algorithm for generating and testing tree patterns,-   (3) a method for the estimation of statistical significance of    individual findings as well as the whole process, based on multiple    permutations but carried out at the cost of a single permutation.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a method for genemapping from chromosome and phenotype data, which utilizes linkagedisequilibrium between genetic markers m_(i), which are polymorphicnucleic acid or protein sequences or strings of single-nucleotidepolymorphisms deriving from a chromosomal region. The method of theinvention comprises steps of

-   -   i) identifying a prefix tree T based on the observed haplotypes        at a number of locations of a chromosome,    -   ii) evaluating each prefix tree T by its genetic and statistical        feasibility, assuming that the gene was close to the root of the        tree, and thus determining a score for each prefix tree T,    -   iii) predicting the area for the location of the gene as a        function of the score determined in the step (ii).

The present invention is now explained in detail by referring to theattached figures and examples. These examples are only used to show someof the embodiments and are not intended to limit the scope of theinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1. A marker map of ten markers and a sample haplotype consisting ofalleles in adjacent markers.

FIG. 2. A carrier of the ancestral mutation has inherited founderalleles around the disease locus. These alleles are similar to those ofthe ancestral chromosome in generation 0. Due to the common inheritedsegment, many of the contemporary mutation carriers are expected toshare alleles in the markers around the mutation, but the length of theshared haplotype varies.

FIG. 3. A possible coalescence tree at the fourth marker for the threeobserved haplotypes at the bottom level. Internal nodes correspond torecurrent substrings. An alternative coalescence tree would have ---344-instead of -1234-- at the second level.

FIG. 4. An illustration of the tree structure in a string-sorted set ofhaplotypes to the right from the location pointed by the arrow.

FIG. 5. Analysis of the performance of TreeDT. A: Gene localizationpower with different values of A, the proportion of disease-associatedchromosomes that actually carry the mutation. B: Gene localization powerwith different numbers of subtrees (method parameter) and differentnumbers of founders (population parameter). C: Classification accuracyfor the existence of a disease susceptibility gene.

FIG. 6. Comparison of the gene localization performance of TreeDT, HPM,multipoint TDT (m-TDT), and TDT. A: The baseline test setting. B: Thebaseline setting with three founders. C: The baseline setting with 5%missing data.

DETAILED DESCRIPTION OF THE INVENTION

It is an object of the present invention to provide a method for genemapping aiming to discover a gene region affecting a certain trait usingchromosome data.

Empirical evaluation on a realistic, simulated data shows that themethod according to the invention is competitive with other recent datamining based methods, and clearly outperforms more traditional methods.Our experiments, explained later, show that the method according to theinvention, TreeDT, is effective in extreme conditions typical forcurrent mapping problems: with lots of noise (only 10-20% of affectedchromosomes carry the mutation, lots of missing data) and with smallsample sizes (200 affected and 200 control chromosomes). However, thehighest potential of the method according to the invention lies in thedata intensive tasks of future—such as genome scanning with largersamples and larger number of markers—due to its low computationalcomplexity.

In comparison to state of the art methods, TreeDT is most competitive.In terms of gene localization accuracy, it gave best results in the caseof multiple founders and demonstrated good robustness with respect tomissing data. Unlike the compared methods, TreeDT can be used to predictwhether a gene is present at all or not. Finally, in comparison to itsclosest competitor, HPM, TreeDT has much smaller computational cost. Anadditional advantage of TreeDT is that it has only one input parameter,the (maximum) number of deviant subtrees, whereas for HPM one has to setseveral more or less arbitrary thresholds.

Method

For any pair of chromosomes in the sample there has been a common originin the population history, an ancestral chromosome at which their pathshave diverged. Due to recombinations different parts of chromosomes havedifferent histories. At any given location the chromosomes in the sampleand their most recent common origins form a coalescence tree. In thecoalescence tree for the DS gene location, all the chromosomes in one ormore subtrees carry the DS mutation, and we should observe excess ofdisease-associated haplotypes as the leaves of these subtrees. Thecloser the tree is located to the DS gene, the more and larger subtreesare identical to those in the tree at the DS gene location.

Based on the observed haplotypes, the method of the invention defines aprefix tree estimating the most likely coalescence tree at a number oflocations along the analyzed chromosome, and then assesses the subtreeclustering of disease-associated haplotypes in these trees.

It is a further object of this invention to provide a novel treedisequilibrium test, intended for predicting DS gene locations in themethod of the invention. The vicinity of the location for which the testgives the lowest p value is the most likely candidate area for the DSgene location. The method also calculates the corrected overall p valuefor the best finding. This p value can be used for predicting whetherthe chromosome carries a DS gene.

Further, a method for the estimation of statistical significance ofindividual findings as well as the whole process, based on multiplepermutations but carried out at the cost of a single permutation, isprovided.

Haplotype Prefix Trees

The subsumption relation of the substrings overlapping a given locationforms a directed acyclic graph (DAG). The tree structures obtainable bypruning the DAG may be considered as possible coalescence trees at thelocation, as shown in FIG. 3, with the following exceptions: 1) Theorder of nodes may differ from that in the true coalescence tree, e.g.-34-- might actually be a more recent node than --1234--. However,because the expected length of the shared region of two chromosomesdecreases monotonically as the time from their divergence increases, itis easy to see that the order given by subsumption is the most likelyone. 2) Because haplotypes may also share a substring by chance, theinternal nodes may represent a combination of nodes in true coalescencetree. The upper nodes of the coalescence tree must be very old and thecorresponding shared chromosomal regions extremely short, and thereforeit is very likely that a large number of coalescence nodes is containedin the empty substring root. On the other hand the younger coalescencenodes with shared regions spanning over several markers are more likelyto have one-to-one correspondence with observed recurrent substrings.

Instead of considering alternative coalescence trees leading to the sameobserved haplotypes, the method of the invention uses the uniquehaplotype prefix tree as a canonical representation of such set ofcoalescence trees. An example of a prefix tree is shown in FIG. 4. Themethod of the invention builds the prefix trees between each pair ofconsecutive markers and tests their disequilibrium.

Tree Disequilibrium Test

According to one embodiment of the invention, the prefix tree T istested by the tree disequilibrium test (TreeDT) testing the alternativehypothesis The distribution of the disease-association statuses deviatesin some subtrees of T from the overall distribution of statuses againstthe null hypothesis The disease-association statuses are randomlydistributed in the leaves of T. TreeDT identifies the subtree set inwhich the observed status distribution deviates most from theexpectation under the null hypothesis, and returns the significance ofthe deviation as a p value. TreeDT takes the maximum number of deviantsubtrees as a parameter. In principle there is no need to set an upperlimit for the subtree count, but whenever LD-mapping is applicable, themajority of the mutation carriers is concentrated in a only few suchsubtrees in which the shared region is long enough to identify a deviantsubstring. In the experiments for this paper we use an upper limit of 6subtrees.

For measuring the disequilibrium of a tree, we use a variant of the Ztest. The test statistic Z_(k) for a tree with k deviant subtrees T₁, .. . , T_(k) is${Z_{k} = {\sum\limits_{i = 1}^{k}\frac{a_{i} - {n_{i}p}}{\sqrt{n_{i}{p\left( {1 - p} \right)}}}}},$where a_(i) is the number of disease-associated haplotypes and n_(i) thetotal number of haplotypes in subtree T_(i)εS, and p is the proportionof disease-associated haplotypes in the sample. The score measures thedistance of the observed number of disease-associated chromosomes(a_(i)) from the expectation (n_(i)p) in standard deviations (the squareroot of n_(i)p(1−p)), under the assumption of binomial distribution withparameters n_(i) and p. We use a one-tailed test, since we areinterested only in subtrees in which the proportion ofdisease-associated haplotypes is greater than expected.

We could use a 2×(k+1) χ²-statistic as a measure of deviation for agiven subtree set S. The χ²-statistic, however, is not easily maximizedin the space of all possible subtree sets and is therefore not a verypractical choice.

Z_(k) can be efficiently maximized simultaneously for all k using arecursive algorithm, as shown in the Algorithms section.

TreeDT takes the maximum number of deviant subtrees as a parameter. Inprinciple there is no need to set an upper limit for the subtree count,but whenever LD-mapping is applicable, the majority of the mutationcarriers is concentrated in only few such subtrees in which the sharedregion is long enough to identify a deviant substring. In theexperiments for this paper we use an upper limit of 6 subtrees.

Significance Tests with Nested Permutations

Z_(k) is a measure for the disequilibrium of a given tree, correspondingto a certain location in the chromosome, with given k deviant subtrees.Given a tree, TreeDT finds for each k the set S of subtrees thatmaximizes Z_(k). In order to find the best k for the given tree, simplemaximization is not possible. Since the statistics for different degreesof freedom k are not comparable, TreeDT estimates the p value for eachmaximized Z_(k) (under the null hypothesis of random distribution ofdisease status). Because the distribution of the maximized Z_(k) is verycomplex and dependent on the tree structure, p values are estimated by apermutation test.

In order to get a single p value for the disequilibrium at a givenlocation, we need to combine the information from the trees to the leftand to the right of the location. As a combined measure we use theproduct of the lowest p value over all k from each side. Again, sincethe measures are not necessarily directly comparable, a new p value forthe combination is estimated. The results are now comparable betweendifferent locations.

The output of TreeDT is essentially the p value ranked list oflocations. A point prediction for the gene location is obtained bytaking the best location; a (potentially fragmented) region of length lis obtained by taking best locations until a length of l is covered.

Since multiple locations are tested for a p value, and also since the pvalues at nearby locations are not independent, a direct link betweenthe p value and the probability that the gene is indeed close to thelocation can not be established. The p values are used simply as amethod of ranking the locations.

However, a single corrected p value for the best finding can be obtainedwith a third test using the lowest local p value as the test statistic.This p value can also be used to answer the question whether there is agene in the investigated are in the first place or not.

All these three nested p value tests (for each tree and k, for eachlocation, for the best location) can be carried out efficiently at thecost of a single test. Table 1 summarizes the three levels of the nestedtest. TABLE 1 A summary of the permutation test procedure. For LevelEach Test statistic Result 1 (T, k) max Z_(k)(S, T) over all p(T, k) S ∈SubtreeSets(T) 2 t min p(T₁, k₁) p(T₂, k₂) p(t), the p value of the treedisequi- over all k₁, k₂ librium test for the pair t = (T₁, T₂) of left-and right-side trees rooted at the same location 3 min p(t) over all tp, the corrected overall p valueAlgorithmsConstructing the Haplotype Prefix-Trees

The haplotype prefix-trees to the left and right from each analyzedlocation can be efficiently identified using a string-sorting algorithm.The algorithm produces as intermediate results for each marker thesorted list of the partial haplotypes to the right from the marker. Allthe right-side trees can be easily derived from these intermediatelists, because the haplotypes belonging to a single node form acontinuous block in the sorted list. The left-side trees can beidentified similarly by sorting the inverted haplotypes. Thecomputational cost of constructing the trees is negligible compared tothe cost of the permutation test procedure.

The same process can also be used to enumerate all the recurrentsubstrings, or all the closed substrings. A substring s is closed, ifand only if none of its superstrings match all the same haplotypes thans. The nodes in the right-side prefix trees have one-to-onecorrespondence to recurrent substrings starting at the same marker.Nodes that are to be split in the next step of the sort algorithmcorrespond to closed substrings.

An Algorithm for Maximizing the Tree Disequilibrium Statistic

It is essential that the time-complexity of the algorithm for maximizingthe Z-values is as low as possible, because it must be executed for eachtree location and permutation in turn. The key observation is that if Sis the set of k deviant subtrees of T with the greatest value of Z_(k),T′ is a subtree of T, and S′⊂S is a set of m subtrees in T′, then S′ hasthe maximum value of Z_(m) in T′. Also, if S=S₁∪ . . . ∪S_(n), and k isthe subtree count in S, and k_(i) is the subtree count in S_(i), then${Z_{k}(S)} = {\sum\limits_{i}^{\quad}{{Z_{k_{i}}\left( S_{i} \right)}.}}$

These observations lead us to the following recursive algorithm thatpropagates the locally maximized Z-values upwards in the tree:

-   Input: A haplotype prefix tree T-   Output: Maximum values of Z_(k) in the tree T for each k-   Call Maximize(T)-   Maximize(T):-   If T is not a leaf:-   1. For each immediate subtree T_(i) of T: Recursively call    Maximize(T_(i)).-   2. For each k: calculate the maximum value Z_(MAX, k)(T) for    Z_(k)(S,T) over all S that can be obtained by combining subtree sets    from each subtree T_(i) of T.-   3. Calculate Z₁ for T. If Z₁>Z_(MAX, 1)(T) then set Z_(MAX, 1)(T):    =Z₁.-   If T is a leaf, then set Z_(MAX, 1)(T): =0.-   Step 2 can be further refined:-   2.1 Set Y_(k): =0 and Z_(MAX, k)(T): =0 for all k, 1≦k≦n, where n is    the number of leaves in T.-   2.2 For each subtree T′ of T:-   2.2.1 For each pair (i,j), 1≦i≦p and 1≦j≦q, where p is the number of    leaves in T′ and q is the total number of leaves in all the subtrees    processed prior to T′:    -   If Z_(MAX, i)(T′)+Y_(j)>Z_(MAX, i+j)(T), then set        Z_(MAX, i+j)(T): =Z_(MAX, i)(T′)+Y_(j).-   2.2.2 For each k, 1≦k≦p:    -   If Z_(MAX, k)(T′)>Z_(MAX, k)(T), then set Z_(MAX, k)(T):        =Z_(MAX, k)(T′).-   2.2.3 For each k, 1≦k≦p+q:    -   If Z_(MAX, k)(T)>Y_(k)(T), then set Y_(k)(T)=Z_(MAX, k)(T)

The time complexity of the algorithm is O(n²) (proof omitted), where nis the number of leaves in the tree i.e. the number of haplotypes in thedata set. By setting an upper limit k for the size of the subtree sets,the average time complexity can be reduced to O(n) with a constantcoefficient proportional to k², k being typically small, ≦10.

An Efficient Algorithm for Multiple Nested Permutation Tests

The straightforward algorithm for a three-level nested permutation testusing nested loops would have time complexity of O(n³qr), where n is thenumber of permutations at each level, q is the time complexity ofmaximizing the Z_(k)-statistic for all k, and r is the number of testedlocations in the chromosome. The test would be intractable already withrather small permutation counts. However, the time complexity can bedrastically reduced using the same set of permutations at each level ofthe test and thus only maximizing the Z_(k)-values n instead of n³ timesfor each location.

-   1. Compute Z_(MAX, k)(T)=max Z_(k)(T,S) for each subtree count k and    each coalescence tree T over all SεSubtreeSets(T).-   2. Randomly generate n+1 permutations of disease-association    statuses for the haplotypes and for each permutation i and (T,k):    compute Z_(MAX, k)(i,T)=max Z_(k)(i,T,S) over all SεSubtreeSets(T).    //Level 1-   3. For each (T,k):-   3.1 Calculate a p value p(T,k) by comparing Z_(MAX, k)(T) to    Z_(MAX, k)(i,T), 1≦i≦n.-   3.2 For each permutation i: calculate a p value p(i,T,k) by    comparing Z_(MAX, k)(i,T) to all Z_(MAX, k)(j,T), j≠i.    //Level 2-   4. For each pair of opposed trees rooted at the same location    t=(T₁,T₂):-   4.1 Choose p_(MIN)(t)=min p(T₁,k₁)p(T₂,k₂) over all k₁, k₂-   4.2 For each permutation i: choose p_(MIN)(i,t)=min    p(i,T₁,k₁)p(i,T₂,k₂) over all k₁, k₂.-   4.3 Calculate a p value p(t) by comparing p_(MIN)(t) to    p_(MIN)(i,t), 1≦i≦n.-   4.4 For each permutation i: calculate a p value p(i,t) by comparing    p_(MIN)(i,t) to all p_(MIN)(i,t), j≠i.    //Level 3-   5. Choose p_(MIN)=min p(t) over all t.-   6. For each permutation i: choose p_(MIN)(i)=min p(i,t) over all t.-   7. Calculate the overall corrected p value by comparing p_(MIN) to    p_(MIN)(i), 1≦i≦n.

The time complexity of steps 3.2 and 4.4 is O(n log n) using analgorithm which first sorts the values of the test statistic for all thepermutations. Step 2 predominates the time complexity of the algorithm,O(nqr), where s is the upper limit for the number of subtrees allowed ina set, q is the time complexity of maximizing the Z_(k)-statistic forall k, and r is the number of tested locations in the chromosome.

Due to the finite number of permutations, the precision of the p valuesgiven by a permutation tests may not be sufficient for accuratelocalization. In some situations even a very large number ofpermutations does not produce any values for the test statistic moreextreme than the observed values for several consecutive tree locations.For this purpose the p values returned by the first and second levelpermutation tests are determined slightly unconventionally: At level 1we use a slightly modified version of algorithm 2 to obtain an upperbound of Z_(k) for all k. At level 2 the smallest possible value for thetest statistic is zero. These values correspond to p values of 1/2(n+1).The returned p value is interpolated between the p values correspondingto the next lower and higher values for the test statistic obtained bypermutations. The top-level test returning the overall p value isimplemented in the usual conservative manner.

EXAMPLES

Certain embodiments and results of the present invention are describedin the following non-limiting examples.

We compare TreeDT empirically to TDT, an established mapping method, andto HPM, our recent proposal based on pattern discovery. We evaluate themethods on a difficult data collection carefully simulated to resemble arealistic population isolate.

Example 1 Simulation of Data

We designed several different test settings, with variation in thefraction (A) of mutation carriers in the disease-associated chromosomes,in the number of founders who introduced the mutation to the population,and in the amount of missing information. For statistical analyses, wecreated 100 independent artificial data sets in each test setting. Greatcare was taken to generate realistic data by a simulation procedure thatincluded four steps: pedigree generation, simulation of inheritance,diagnosing, and sampling.

The population pedigree was set to grow from 100 to 100,000 individualsin a period of 20 generations. In each generation, the selection ofparents for each child was random, but once a couple was formed, allsubsequent children allocated to either of the parents were set to becommon children of the couple.

The inheritance of chromosomes within the population pedigree wassimulated by first allocating a continuous chromosomal segment of 100centiMorgans to each founder individual in generation 1.

Morgan is a unit of genetic length. 1 cM is the distance at whichrecombination occurs 1 out of every 100 times, about 10⁶ base pairs.Human chromosomes are roughly of 50-300 cM.

Next, the entire pedigree was traversed top-down, and, in eachinheritance event, gametes were created by simulating meiosis under theassumption that the number of chiasmata in the pair of homologouschromosomes was taken from Poisson distribution with parameter one(corresponding to the genetic length of 100 cM), and their locationsselected randomly. A related approach was originally presented in(Terwilliger et al., 1993).

For a baseline test setting we selected a challenging disease modelwhere only a small proportion (A=10%) of the disease-associatedchromosomes carries the disease-predisposing mutation, a complicationthat often is encountered in the analysis of common diseases. In thebaseline setting there is one founder, and on average 3.7% of allelesare missing, making the mapping task more difficult but also morerealistic.

The location of the mutation was selected randomly and independently foreach of the 100 data sets produced in every setting. Each data set wasin turn collected from 100 affected individuals. The length of theregion to be analyzed was 100 cM. Allelic data were created using a mapof 101 equidistantly spaced markers, each having 5 alleles. Bothchromosomes of each affected individual in each sample were labeleddisease-associated whereas the control chromosomes were constructed fromthe non-transmitted alleles in the parental chromosomes. Each data setthus consisted of 200 disease-associated and 200 control chromosomes

Example 2 Analysis of TreeDT

First we assess the prediction accuracy of TreeDT with different valuesof A, the proportion of disease-associated chromosomes that actuallycarry the mutation (FIG. 5A). The results are reported as curves thatshow the percentage of 100 data sets where the gene is within thepredicted region, as a function of the length of the predicted region.Or, in other words, the x coordinate tells the cost a geneticist iswilling to pay, in terms of the length of the region to be furtheranalyzed, and the y coordinate gives the probability that the gene iswithin the region. For A=20% or 15% the accuracy is very good, and withlower values of A the accuracy decreases until with A=5% only in 20-30%of data sets can the gene be localized within a reasonable accuracy of10-20 cM. We remind the reader that the test settings have been designedto be challenging, and to test the limits of the approach.

Next we evaluate the effect of the only parameter of TreeDT, the numberof deviant subtrees that are searched for in each tree. An upper limitof 6 subtrees, used in the previous test, is evaluated against fixedamounts of 1, 2, or 3 subtrees, with a varying number of founders thatintroduced the mutation (FIG. 5B). As we increase the number offounders, evidence about the gene location becomes more fragmented, andaccordingly the performace degrades. While the differences betweendifferent numbers of subtrees are not large, it is interesting to notethat for each number of founders, the same number of subtrees givesmarginally the best result. The upper limit of 6 subtrees givesconsistently competitive results, so we continue using it in thefollowing experiments.

Gene mapping studies like the ones imitated in the above tests assume,based on some other analyses, that a disease susceptibility gene isindeed present in the analyzed area. TreeDT has the important advantageover plain gene localization methods that it can also be used to predictwhether the analyzed region contains a disease susceptibility gene atall or not. The overall p value TreeDT produces indicates the correctedsignificance of the best single finding, and by setting an upper limitfor its value TreeDT can be used to classify data sets to ones that door do not contain a gene. For data sets with no gene, TreeDT correctlyproduces overall p values that are uniformly distributed in [0,1]. So,smaller thresholds for p result in less false positives, but also inless true positives. FIG. 5C shows the experimental relationshipsbetween power (ratio true positivites/all positives) and overall p(ratio false positives/all negatives). For higher values of A theclassification accuracy is extremely good. For A=5% it is comparable torandom guessing, although TreeDT is still able to locate an existinggene adequately in 20-30% of the cases (FIG. 5A).

Example 3 Comparison to Other Methods

TreeDT, HPM, and m-TDT have practically identical performance inlocalizing the DS gene in the baseline setting (FIG. 6A). TDT is clearlyinferior compared to the other methods. Tests with other values of Agive similar results.

In a test setting with three founders who introduced the mutation to thepopulation, differences between the three best methods start to appear(FIG. 6B). TreeDT has an edge over HPM, which in turn has an edge overm-TDT. TDT barely beats random guessing.

Finally, we compare the methods with a large amount of missing data(FIG. 6C). Expectedly, HPM is most robust with respect to missing datasince it allows gaps in its haplotype patterns. Surprisingly, TreeDT isnot much weaker than HPM, although no actions have been taken in it toaccount for missing or erroneus data. Performance of m-TDT degrades muchmore clearly.

Method to method comparisons (not shown) indicate that the predictionerrors are mostly caused by random effects in population history—sincedifferent methods tend to make mistakes in the same data sets—ratherthan by systematic differences between the methods. However, those caseswhere one method succeeds and another fails will give useful input forfurther improvements of the methods.

The execution time of TreeDT for a single dataset is about ten minutesusing 1,000 permutations on a 450 MHz Pentium II. The respective timefor HPM with permutations is over 20 minutes.

REFERENCES

-   [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association    rules between sets of items in large databases. In P. Buneman and S.    Jajodia, editors, Proceedings of 1993 ACM SIGMOD Conference on    Management of Data, pp 207-216. ACM, Washington, D.C., May 1993.-   [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo.    Fast Discovery of Association Rules. In U. Fayyad, G.    Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in    Knowledge Discovery and Data Mining, pp 307-328. AAAI Press, Menlo    Park, Calif., 1996.-   [3] B. Devlin, N. Risch, and K. Roeder. Disequilibrium Mapping:    Composite Likelihood for Pairwise Disequilibrium. Genomics, 36:1-16,    1996.-   [4] L. Kruglyak, M. Daly, M. Reeve-Daly, E. Lander. Parametric and    Nonparametric Linkage Analysis: a Unified Multipoint Approach. Am J    Hum Genet, 58:1347-1363, 1996.-   [5] L. Lazzeroni. Linkage Disequilibrium and Gene Mapping: an    Empirical Least-Squares Approach. Am J Hum Genet, 62:159-170, 1998.-   [6] M. McPeek and A. Strahs. Assessment of Linkage Disequilibrium by    the Decay of Haplotype Sharing, with Application to Fine-scale    Genetic Mapping. Am J Hum Genet, 65:858-875, 1999.-   [7] A. Nakaya, H. Hishigaki, and S. Morishita. Mining the    Quantitative Trait Loci Associated with Oral Glucose Tolerance in    the Oletf Rat. Proc. of Pacific Symposium on Biocomputing, pp    367-379, Jan. 4-9, 2000.-   [8] S. Service, D. Temple Lang, N. Freimer, and L. Sandkuijl.    Linkage-Disequilibrium Mapping of Disease Genes by Reconstruction of    Ancestral Haplotypes in Founder Populations. Am J Hum Genet,    64:1728-1738, 1999.-   [9] P. Sevon, V. Ollikainen, P. Onkamo, H. Toivonen, H. Mannila,    and J. Kere. Mining Associations Between Genetic Markers, Phenotypes    and Covariates. Genetic Analysis Workshop 12, Genetic Epidemiology,    21 (Suppl. 1), 2001. In press.-   [10] P. Sevon, H. Toivonen, V. Ollikainen. TreeDT: gene mapping by    tree disequilibrium test (extended version). Report C-2001-32,    Department of Computer Science, University of Helsinki, Finland,    2001.-   [11] R. Spielman, R. McGinnis, W. Ewens. Transmission Test for    Linkage Disequilibrium: The Insulin Gene Region and    Insulin-Dependent Diabetes Mellitus (IDDM). Am J Hum Genet,    52:506-516, 1993.-   [12] J. Terwilliger, M. Speer, J. Ott. Chromosome-Based Method for    Rapid Computer Simulation in Human Genetic Linkage Analysis. Genetic    Epidemiology, 10:217-224, 1993.-   [13]J. Terwilliger. A Powerful Likelihood Method for the Analysis of    Linkage Disequilibrium Between Trait Loci and One ore More    Polymorphic Marker Loci. Am J Hum Genet, 56:777-787, 1995.-   [14] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H.    Mannila, M. Herr, and J. Kere. Data Mining Applied to Linkage    Disequilibrium Mapping. Am J Hum Genet, 67:133-145, 2000.-   [15] H. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H.    Mannila, and J. Kere. Gene Mapping by Haplotype Pattern Mining.    Proc. Bio-Informatics and Biomedical Engineering, pp 99-108,    Arlington, Va., Nov. 8-10, 2000.

1. A method for gene mapping to discover a gene or DNA region affectinga certain trait using chromosome and phenotype data, which methodutilizes linkage disequilibrium between genetic markers m_(i), which arepolymorphic nucleic acid or protein sequences or strings ofsingle-nucleotide polymorphisms deriving from a chromosomal region,which method comprises following steps: i) identifying a prefix tree Tbased on the observed haplotypes at a number of locations of achromosome, ii) evaluating each prefix tree T by its genetic andstatistical feasibility, assuming that the gene was close to the root ofthe tree, and thus determining a score for each prefix tree T, iii)predicting the area for the location of the gene as a function of thescore determined in the step (ii).
 2. The method according to claim 1,wherein in the step (i) the prefix tree T is build between each pair ofconsecutive markers.
 3. The method according to claim 1 or 2, whereinthe prefix tree T is build using a string-sorting algorithm.
 4. A methodaccording to claim 1, wherein the prefix tree T is evaluated by treedisequilibrium test testing the alternative hypothesis The distributionof the disease-association statuses deviates in some subtrees of T fromthe overall distribution of statuses against the null hypothesis Thedisease-association statuses are randomly distributed in the leaves ofT.
 5. A method according to claim 4, wherein for measuring thedisequilibrium of a tree a test statistic Z_(k) for a tree with kdeviant subtrees T₁, . . . , T_(k) is calculated by the followingformula:${Z_{k} = {\sum\limits_{i = 1}^{k}\frac{a_{i} - {n_{i}p}}{\sqrt{n_{i}{p\left( {1 - p} \right)}}}}},$where a_(i) is the number of disease-associated haplotypes and n_(i) thetotal number of haplotypes in subtree T_(i)εS, S being the given subtreeset, and p is the proportion of disease-associated haplotypes in thesample.
 6. A method according to claim 4 or 5, wherein the followingalgorithm is used: Input: A haplotype prefix tree T Output: Maximumvalues of Z_(k) in the tree T for each k Call Maximize(T) Maximize(T):If T is not a leaf:
 1. For each immediate subtree T_(i) of T:Recursively call Maximize(T_(i)).
 2. For each k: calculate the maximumvalue Z_(MAX, k)(T) for Z_(k)(S,T) over all S that can be obtained bycombining subtree sets from each subtree T_(i) of T.
 3. Calculate Z₁ forT. If Z₁>Z_(MAX, 1)(T) then set Z_(MAX, 1)(T): =Z₁. If T is a leaf, thenset Z_(MAX, 1)(T): =0.
 7. A method according to claim 6, wherein step 2is further refined as follows: 2.1 Set Y_(k): =0 and Z_(MAX, k)(T): =0for all k, 1≦k≦n, where n is the number of leaves in T. 2.2 For eachsubtree T′ of T: 2.2.1 For each pair (i,j), 1≦i≦p and 1≦j≦q, where p isthe number of leaves in T′ and q is the total number of leaves in allthe subtrees processed prior to T′: IfZ_(MAX, i)(T)+Y_(j)>Z_(MAX, i+j)(T), then set Z_(MAX, i+j)(T):=Z_(MAX, i)(T′)+Y_(j). 2.2.2 For each k, 1≦k≦p: IfZ_(MAX, k)(T)>Z_(MAX, k)(T), then set Z_(MAX, k)(T): =Z_(MAX, k)(T′).2.2.3 For each k, 1≦k≦p+q: If Z_(MAX, k)(T)>Y_(k)(T), then set Y_(k)(T):=Z_(MAX, k)(T)
 8. A method according to any of claims 4 to 7 wherein thesignificance of the disequilibrium at a given location is tested bymultiple nested permutation test.
 9. A method according to claim 8,wherein the permutation test comprises following steps: finding for eachk the set S of subtrees that maximizes Z_(k) and estimating the p valuefor each maximized Z_(k) estimating a new p value for a combination ofthe information from the prefix tree T to the left and to the right ofthe location, combined measure being the product of the lowest p valueover all k, and ranking locations by the new p values, obtaining thepoint prediction for the gene location by taking the best location fromthe p value ranked list of locations and obtaining a single corrected pvalue for the best finding with a test using the lowest local p value asthe test statistic.
 10. A method according to claim 9, wherein thefollowing algorithm is used:
 1. Compute Z_(MAX, k)(T)=max Z_(k)(T,S) foreach subtree count k and each coalescence tree T over allSεSubtreeSets(T).
 2. Randomly generate n+1 permutations ofdisease-association statuses for the haplotypes and for each permutationi and (T,k): compute Z_(MAX, k)(i,T)=max Z_(k)(i,T,S) over allSεSubtreeSets(T). //Level 1
 3. For each (T,k): 3.1 Calculate a p valuep(T,k) by comparing Z_(MAX, k)(T) to Z_(MAX, k)(i,T), 1≦i≦n. 3.2 Foreach permutation i: calculate a p value p(i,T,k) by comparingZ_(MAX, k)(i,T) to all Z_(MAX, k)(j,T), j≠i. //Level 2
 4. For each pairof opposed trees rooted at the same location t=(T₁,T₂): 4.1 Choosep_(MIN)(t)=min p(T₁,k₁)p(T₂,k₂) over all k₁, k₂ 4.2 For each permutationi: choose p_(MIN)(i,t)=min p(i,T₁,k₁)p(i,T₂,k₂) over all k₁, k₂. 4.3Calculate a p value p(t) by comparing p_(MIN)(t) to p_(MIN)(i,t), 1≦i≦n.4.4 For each permutation i: calculate a p value p(i,t) by comparingp_(MIN)(i,t) to all p_(MIN)(j,t), j≠i. //Level 3
 5. Choose p_(MIN)=minp(i,t) over all t.
 6. For each permutation i: choose p_(MIN)(i)=minp(i,t) over all t.
 7. Calculate the overall corrected p value bycomparing p_(MIN) to p_(MIN)(i), 1≦i≦n.
 11. A computer-readable datastorage medium having computer-executable program code stored thereonoperative to perform a method of any of preceding claims when executedon a computer.
 12. A computer system programmed to perform the method ofany of claims 1-10.